Support Hadoop SequenceFiles Scan#14061
Conversation
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
This reverts commit 2ab5557.
There was a problem hiding this comment.
Pull request overview
This PR adds support for reading Hadoop SequenceFiles in the RAPIDS Accelerator for Apache Spark. It registers a new file format sequencefilebinary that reads SequenceFile key/value pairs as raw BinaryType columns on the GPU.
Key Changes
- Introduces
SequenceFileBinaryFileFormatas a new DataSource that reads SequenceFiles and exposes key/value as BinaryType columns - Implements GPU-accelerated reading via
GpuReadSequenceFileBinaryFormatand associated partition readers - Integrates the new format into
GpuFileSourceScanExecfor GPU execution path routing
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
| sql-plugin/src/main/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormat.scala | CPU-side FileFormat implementation with row-based reader for SequenceFiles |
| sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuReadSequenceFileBinaryFormat.scala | GPU-enabled FileFormat wrapper with metadata support and multi-file reader factory |
| sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala | Core GPU partition readers with host-side buffering and device column materialization |
| sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuFileSourceScanExec.scala | Integration point registering SequenceFileBinary format in GPU scan execution |
| tests/src/test/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormatSuite.scala | Test suite in tests module for wildcard discovery |
| sql-plugin/src/test/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormatSuite.scala | Duplicate test suite in sql-plugin module |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormat.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/test/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormatSuite.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormat.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala
Show resolved
Hide resolved
sql-plugin/src/test/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormatSuite.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala
Show resolved
Hide resolved
tests/src/test/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormatSuite.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/test/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormatSuite.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
|
@Greptile full review |
Greptile SummaryThis PR adds GPU-accelerated reading of Hadoop SequenceFiles by introducing a new Key changes:
Issues found:
Confidence Score: 2/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Driver
participant Meta as GpuSequenceFileSerializeFromObjectExecMeta
participant Exec as GpuSequenceFileSerializeFromObjectExec
participant Factory as GpuSequenceFileMultiFilePartitionReaderFactory
participant Reader as MultiFileCloudSequenceFilePartitionReader
participant Hadoop as SequenceFile.Reader (CPU)
participant GPU as cuDF / GPU
Driver->>Meta: tagPlanForGpu()
Meta->>Meta: isSimpleSequenceFileRDD() [reflection]
Meta->>Meta: isCompressedSequenceFile() [header peek]
Meta-->>Driver: willNotWorkOnGpu / OK
Driver->>Exec: internalDoExecuteColumnar()
Exec->>Factory: build reader factory
Exec->>Reader: GpuDataSourceRDD per FilePartition
loop For each PartitionedFile (threaded)
Reader->>Hadoop: SequenceFile.Reader.nextRaw()
Hadoop-->>Reader: raw key/value bytes
Reader->>Reader: HostBinaryListBufferer.addBytes()
end
alt combineConf active (small files)
Reader->>Reader: combineHMBs() — zero-copy chunk aggregation
end
Reader->>GPU: buildDeviceColumnFromChunks() → ColumnVector.concatenate()
GPU-->>Exec: ColumnarBatch (LIST<UINT8> key + value)
Exec-->>Driver: Iterator[ColumnarBatch]
|
There was a problem hiding this comment.
Additional Comments (4)
-
sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala, line 125-128 (link)logic: check for potential INT32 overflow before it happens
the check happens after
dataLocationhas already grown beyondInt.MaxValue, which could cause issues during the buffer growth operations inaddBytesoraddValueBytes. move the overflow check earlier in those methods before updatingdataLocation. -
sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala, line 73-83 (link)logic: potential integer overflow in row capacity calculation
rowsAllocated * 2can overflow whenrowsAllocatedis close toInt.MaxValue / 2. this causes the allocation to wrap to negative or small values. -
tests/src/test/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormatSuite.scala, line 32-136 (link)style: missing test coverage for key scenarios
add tests for:
- compressed SequenceFiles (should throw
UnsupportedOperationException) - multi-file reads to verify the multi-file reader path
- large batches that exceed
maxRowsPerBatchormaxBytesPerBatch - partition columns
- reading only key or only value (not both)
- empty files
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
- compressed SequenceFiles (should throw
-
tests/src/test/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormatSuite.scala, line 102-136 (link)style: tests only verify the CPU reader path, not GPU
this test uses
SequenceFileBinaryFileFormatwhich is the CPU fallback. to test the GPU path withGpuReadSequenceFileBinaryFormat, you'd need to enable the Rapids plugin configuration and verify GPU execution.
5 files reviewed, 4 comments
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 12 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormat.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormat.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuReadSequenceFileBinaryFormat.scala
Outdated
Show resolved
Hide resolved
tests/src/test/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormatSuite.scala
Outdated
Show resolved
Hide resolved
tests/src/test/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormatSuite.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 13 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala
Outdated
Show resolved
Hide resolved
tests/src/test/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormatSuite.scala
Outdated
Show resolved
Hide resolved
tests/src/test/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormatSuite.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
105cf2a to
9b1162e
Compare
|
@greptileai full review |
|
@greptileai full review |
sql-plugin/src/main/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormat.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 10 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuReadSequenceFileBinaryFormat.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormat.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormat.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
|
I am a little concerned about the general direction of this PR. If the RDDScan is a problem, then perhaps we should be trying to understand how to address that in a generic way instead of then sequence file reader code here. My main concern with the sequence file reader code is that there is a code change required for a user. The no code change is a fundamental principle of this project and I am concerned that we are violating that. This makes it much harder for a customer to switch over and also hard for a customer to fall back to the CPU if they need to. |
The customer is open to changing how they read the files. But I totally agree that we should keep the no-code-change principle here. It should be possible to introduce a rule that converts the RDDScan to a GPU file reader without changing the user code. I will work on it. |
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
|
NOTE: release/26.02 has been created from main. Please retarget your PR to release/26.02 if it should be included in the release. |
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
|
@greptileai full review |
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
6fd0076 to
3fe2cd6
Compare
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
|
@greptileai full review |
Closes #14065
Description
This PR:
com.nvidia.spark.rapids.SequenceFileBinaryFileFormatto read Hadoop Sequence Files to GPU:
spark.read.format("com.nvidia.spark.rapids.SequenceFileBinaryFileFormat").load(logPath)Performance tests
val NUM_FILES = 200
val RECORDS_PER_FILE = 50000
val VALUE_SIZE = 1024
val ITERATIONS = 5
I ran performance tests on 200 files with 50,000 records and a 1 MB size per value.
script
Since the decode happens on CPU we got similar perf numbers with CPU file format and we need to copy data to GPU for GPU file format. But it's about 2 times faster than CPU RDD scan.
Checklists
(Please explain in the PR description how the new code paths are tested, such as names of the new/existing tests that cover them.)